Clinical machine learning systems that behave inconsistently across hospitals are hard to trust, and harder still to govern. This paper takes the position that explanation stability under distribution shift is not a secondary concern—it is a core requirement for federated learning in healthcare. We present Shift-Stable Explanations via Aggregation (SSE-AGG), a server-side framework that makes explanation consistency a first-class objective during training. Rather than treating all client updates equally, SSE-AGG clusters clients based on their model updates and then steers aggregation toward those whose feature-importance sketches—computed via permutation importance over a small shared reference set—agree with the cohort consensus. The method requires no data sharing, no additional libraries, and no changes to client-side code, making it straightforward to layer on top of existing FL pipelines. It works with FedAvg, FedProx, and SCAFFOLD. We define and measure three explanation-stability metrics: pairwise L? drift across clients, consensus drift across rounds, and top-k Jaccard overlap. Experiments on controlled non-IID partitions show that SSE-AGG meaningfully reduces explanation drift—especially under high heterogeneity—without degrading predictive accuracy. Taken together, the results suggest that explanation-aware aggregation is both feasible and practically useful as a step toward more interpretable and governable federated clinical systems.
Introduction
The text presents a federated learning framework for clinical machine learning that focuses not only on predictive accuracy under non-IID data across hospitals, but also on a less explored problem: stability of model explanations.
In healthcare federated learning, models often generalize poorly across hospitals due to differences in patient populations and clinical practices. This leads to distribution shift, where even accurate global models can fail at specific sites. Beyond accuracy, the paper highlights another critical issue: explanation drift, where different hospitals rely on different features (e.g., age vs. oxygen saturation) for similar predictions, making clinical trust and auditing difficult.
To address this, the work proposes an explanation-aware federated aggregation method (SSE-AGG). Each client computes lightweight feature-importance “sketches” using permutation importance. The server then compares these explanations across clients and assigns higher weights to clients whose reasoning aligns with the global consensus, while down-weighting outliers. This is combined with additional objectives for:
worst-case client performance,
prediction calibration,
and standard data-size weighting.
The framework also includes cohort-based clustering, grouping similar clients before aggregation to preserve legitimate heterogeneity while still enforcing explanation consistency within groups.
Experiments on tabular clinical data and medical imaging show that the method improves explanation stability across clients and over training rounds, while maintaining predictive performance comparable to standard federated learning methods.
Conclusion
The core argument of this paper is that explanation stability should be treated as a system property that can be measured and optimized, not a soft desideratum that is evaluated post-hoc. SSE-AGG operationalizes this by building explanation agreement directly into the server aggregation step, using permutation importance sketches computed on a shared unlabeled reference set. The results show that doing so reduces cross-client and round-to-round drift by 20–40% under high non-IID heterogeneity, with no meaningful cost to predictive performance. The hierarchical Cluster-then-SSE variant extends this to settings where client populations are legitimately diverse, preserving real clinical variation while stabilizing within-cohort explanations. From a practical standpoint, SSE-AGG requires no changes to client-side code, no labeled probes, and no special libraries—it slots in as a modified aggregation step and is compatible with FedAvg, FedProx, and SCAFFOLD. We hope it contributes to a broader shift in how federated clinical systems are evaluated: not only by aggregate accuracy, but by consistency, calibration, and the interpretability of what the model has actually learned.
References
[1] T. Li, F. Qiao, M. Ma, and X. Peng, “Are data-driven explanations robust against out-of-distribution data?” in Proc. IEEE/CVF CVPR, 2023, pp. 3821–3831.
[2] T. A. Lasko, E. V. Strobl, and W. W. Stead, “Why do probabilistic clinical models fail to transport between sites,” npj Digital Medicine, vol. 7, no. 1, 2024.
[3] A. Avati, M. Seneviratne, E. Xue, Z. Xu, B. Lakshminarayanan, and A. M. Dai, “BEDS-Bench: Behavior of EHR-models under distributional shift—a benchmark study,” arXiv:2107.08189, 2021.
[4] W. Yi, H. Zhang, J. Yu, X. Wang, and H. Li, “A trust-based federated learning scheme for collaborative learning across edge,” IEEE Internet of Things J., vol. 9, no. 19, pp. 18652–18662, 2022.
[5] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith, “Federated optimization in heterogeneous networks,” arXiv:1812.06127, 2018.
[6] S. P. Karimireddy, S. Kale, M. Mohri, S. J. Reddi, S. U. Stich, and A. T. Suresh, “SCAFFOLD: Stochastic controlled averaging for federated learning,” arXiv:1910.06378, 2019.
[7] F. Sattler, K.-R. Müller, and W. Samek, “Clustered federated learning: Model-agnostic distributed multitask optimization under privacy constraints,” arXiv:1910.01991, 2019.
[8] A. Ghosh, J. Chung, D. Yin, and K. Ramchandran, “An efficient framework for clustered federated learning,” in Proc. NeurIPS, 2020.
[9] C. Fung, C. J. M. Yoon, and I. Beschastnikh, “Mitigating Sybils in federated learning poisoning,” arXiv:1808.04866, 2018.
[10] P. Blanchard, E. M. El Mhamdi, R. Guerraoui, and J. Stainer, “Machine learning with adversaries: Byzantine tolerant gradient descent,” in Proc. NeurIPS, 2017.
[11] A. Altmann, L. Tolo?i, O. Sander, and T. Lengauer, “Permutation importance: A corrected feature importance measure,” Bioinformatics, vol. 26, no. 10, pp. 1340–1347, 2010.
[12] A. Fisher, C. Rudin, and F. Dominici, “Model class reliance: Variable importance measures for any machine learning model class, from the ‘Rashomon’ perspective,” J. Mach. Learn. Res., vol. 20, no. 141, pp. 1–81, 2019.
[13] S. Li, E. C.-H. Ngai, and T. Voigt, “An experimental study of Byzantine-robust aggregation schemes in federated learning,” arXiv:2302.07173, 2023.
[14] D. Yin, Y. Chen, K. Ramchandran, and P. Bartlett, “Byzantine-robust distributed learning: Towards optimal statistical rates,” in Proc. ICML, vol. 80, pp. 5650–5659, 2018.